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1.  INTRODUCTION 


This  project  concerns  the  design  and  analysis  of  algorithms  to  be  run  in  a 
processor-rich  environment.  We  focus  primarily  on  algorithms  that  require  no  global 
control  and  that  can  be  run  on  systems  with  only  local  connections  among  processors. 
We  investigate  the  properties  of  these  algorithms  both  theoretically  and  experimentally. 
The  experimental  work  has  been  done  on  the  ZMOB,  a  parallel  computer  operated  by 
the  Laboratory  for  Parallel  Computation  of  the  Computer  Science  Department  at  the 
University  of  Maryland,  although  recently  we  have  gained  access  to  a  BBN  Butterfly 
computer  as  well. 

The  ZMOB  consists  of  128  processors  which  communicate  by  message  passing  over 
a  communications  network  which  provides  a  complete  network  of  connections  between 
processors.  The  start-up  time  for  interprocessor  communication,  the  per-word  transmis¬ 
sion  overhead,  and  the  floating  point  computation  time  is  all  of  the  same  order  of  magni¬ 
tude. 

It  is  important  to  be  precise  about  how  we  use  the  ZMOB  in  our  research.  What 
we  do  not  do  is  to  investigate  algorithms  for  the  ZMOB  itself.  Instead  we  use  the  fact 
that  the  ZMOB  appears  to  be  a  completely  connected  network  to  simulate  various 
locally  connected  networks  of  processors.  Thus  we  can  investigate,  in  a  realistic  setting, 
the  effects  on  our  algorithms  of  various  processor  interconnections. 

Our  activities  may  be  divided  into  four  categories:  algorithms,  software  develop¬ 
ment,  theoretical  analysis,  and  experimental  analysis. 

To  give  our  work  direction,  we  have  focused  on  dense  and  sparse  problems  from 
numerical  linear  algebra.  We  discuss  in  this  summary  the  research  projects  that  we  have 
pursued  under  this  grant  support  over  the  past  year. 


2.  Summary  of  Work 

Our  activities  have  ranged  from  theoretical  analysis  to  algorithmic  design  and 
software  development.  We  summarize  this  work  in  the  following  sections.  For  details, 
consult  the  annotated  list  of  references  in  Appendix  A. 
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2.1  Data-Flow  Algorithms  and  Domino 

We  have  based  most  of  our  work  in  this  area  on  the  notion  of  a  data-flow  algo¬ 
rithm.  The  computations  in  a  data-flow  algorithm  are  done  by  independent  computa¬ 
tional  nodes,  which  cycle  between  requesting  data  from  certain  nodes,  computing,  and 
sending  data  to  certain  other  nodes.  More  precisely,  the  nodes  lie  at  the  vertices  of  a 
directed  graph  whose  arcs  represent  lines  of  communication.  Each  time  a  node  sends 
data  to  another  node,  the  data  is  placed  in  a  queue  on  the  arc  between  the  two  nodes. 
When  a  node  has  requested  data  from  other  nodes,  it  is  blocked  from  further  execution 
until  the  data  it  has  requested  arrives  at  the  appropriate  input  queues.  An  algorithm 
organized  in  this  manner  is  called  a  data-flow  algorithm  because  the  times  at  which 
nodes  can  compute  is  controlled  by  the  flow  of  data  between  nodes. 

Datarflow  algorithms  are  well  suited  for  implementation  on  networks  of  processors 
which  communicate  by  message  passing.  Each  node  in  a  computational  network  is 
regarded  as  a  process  residing  on  a  fixed  member  of  a  network  of  processors.  We  allow 
more  than  one  node  on  a  processor.  Since  many  nodes  will  be  performing  essentially  the 
same  functions,  we  allow  nodes  which  share  a  processor  also  to  share  pieces  of  reentrant 
code,  which  we  call  node  programs.  Each  processor  has  a  resident  node  communication 
and  control  system  to  receive  and  transmit  messages  from  other  processors  and  to 
awaken  nodes  when  their  data  has  arrived. 

Datarflow  algorithms  have  a  number  of  advantages. 

1.  The  approach  eliminates  the  need  for  global  synchronization. 

2.  Parallel  matrix  algorithms,  including  all  algorithms  for  systolic  arrays, 
have  datarflow  implementations. 

3.  Data-flow  algorithms  can  be  coded  in  a  high-level  sequential  program¬ 
ming  language,  augmented  by  two  communication  primitives  for  sending 
and  receiving  data. 

4.  Data-flow  computations  can  be  supported  by  a  very  simple  node  com¬ 
munication  and  control  system. 

5.  The  approach  allows  the  graceful  handling  of  missized  problems,  since 
several  nodes  can  be  mapped  onto  one  processor. 

6.  By  mapping  all  nodes  in  a  datarflow  algorithm  onto  a  single  processor, 
one  can  debug  parallel  algorithms  on  an  ordinary  sequential  processor. 


Because  of  the  conceptual  convenience  and  practical  utility  of  the  data-flow 
approach,  and  because  of  the  absence  of  any  standard  for  writing  transportable 
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algorithms  for  parallel  machines,  we  have  implemented  these  ideas  in  a  node  communicar 
tion  and  control  system  called  Domino.  We  have  documented  the  system  and  provided 
examples  of  its  use  in  a  technical  report.  The  system  currently  runs  on  the  ZMOB, 
Vaxes  under  Unix  or  VMS,  Sun  workstations,  and  IBM  PC’s.  A  Butterfly  implementa¬ 
tion  is  underway.  The  code  is  currently  being  used  for  numerical  analysis  and  for  neural 
network  studies  at  Maryland.  The  system  has  been  very  valuable  to  us  in  our  research, 
and  will  be  used  in  a  course  on  parallel  computation  taught  next  fall  at  Maryland.  We 
have  already  received  numerous  inquiries  from  potential  users  in  industry  and  academics 
and  will  make  Domino  available  over  the  Arpanet  through  Netlib  at  Argonne  National 
Lab. 


2.2  Theoretical  Developments 

Work  has  been  done  in  the  design  of  parallel  architectures  and  in  the  analysis  of 
parallel  algorithms. 

Our  work  on  the  determinacy  of  our  data-flow  model  for  parallel  computation  led 
us  to  propose  a  modification  of  the  design  of  systolic  arrays  in  order  to  eliminate  the 
need  for  global  synchronization.  Each  cell  in  the  array  is  augmented  by  a  feedback  cir¬ 
cuit  so  that  data  is  sent  from  one  cell  to  another  only  when  the  receiver  is  ready  to  pro¬ 
cess  it.  We  call  such  networks  systaltic  arrays. 

David  C.  Fisher  completed  a  thesis  partially  supported  by  this  grant  which  studies 
the  complexity  of  various  tasks  in  matrix  computation,  assuming  that  processors  perform 
computations  so  fast  that  the  communication  delay  in  sending  between  physically  dis¬ 
tant  processors  is  significant.  Lower  bounds  on  execution  time  were  obtained,  and 
optimal  algorithms  were  derived  for  several  problems. 


2.3  Algorithm  Design,  Analysis,  and  Testing 

The  chief  difficulty  with  the  data-flow  approach  is  that  the  behavior  of  the  algo¬ 
rithms  cannot  be  analyzed  purely  from  the  local  viewpoint  of  the  node  programs.  This  is 
one  reason  for  supplementing  theory  with  experiment. 

This  year,  we  devised  a  number  of  new  parallel  algorithms.  For  dense  matrices  we 
developed  algorithms  for  computing  the  QR  factorization  of  a  matrix  and  a  parallel  ver¬ 
sion  of  the  QR  algorithm  for  computing  eigenvalues.  For  sparse  matrices,  we  are  work¬ 
ing  on  simultaneous  iteration  methods  for  eigenvalues  and  block  conjugate  gradient  algo¬ 
rithms  for  solving  linear  systems. 


3.  Summary 
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Four  papers  supported  under  this  grant  appeared  in  refereed  journals  during  this 
year,  and  two  were  accepted  for  publication.  Invited  talks  were  given  at  universities  and 
at  conferences  whose  themes  ranged  from  parallel  processing  to  statistics  and  mathemat¬ 
ical  programming.  One  graduate  student  completed  his  dissertation  and  two  others 
made  substantial  progress.  The  Domino  software  has  been  documented  and  prepared  for 
distribution. 

Our  work  resulted  in  a  collection  of  parallel  algorithms  for  matrix  computations,  a 
data-flow  operating  system  to  support  experiments,  and  theoretical  investigation  into 
complexity  and  determinacy  issues  in  parallel  matrix  computations. 


Appendix 

Accomplishments  under  Grant  AFOSR  82-0078 


I.  Technical  Reports 


(I)  G.  W.  Stewart,  Computing  the  CS  Decomposition  of  a  Partitioned  Orthonormal  Matrix, 
TR-1159,  May,  1982. 

This  paper  describes  an  algorithm  for  simultaneously  diagonalizing  by  orthogonal  transfor¬ 
mation  the  blocks  of  a  partitioned  matrix  having  orthonormal  columns. 


(2)  G.  W.  Stewart  A  Note  on  Complex  Division,  TR-1206,  August,  1982. 

An  algorithm  (Smith,  1962)  for  computing  the  quotient  of  two  complex  numbers  is 
modified  to  make  it  more  robust  in  the  presence  of  underflows. 


(3)  D.  P.  O’Leary,  Solving  Sparse  Matrix  Problems  on  Parallel  Computers,  TR-1234, 
December,  1982. 

This  paper  has  a  dual  character.  The  first  part  is  a  survey  of  some  issues  and  ideas  for 
sparse  matrix  computation  on  parallel  processing  machines.  In  the  second  part,  some  new 
results  are  presented  concerning  efficient  parallel  iterative  algorithms  for  solving  mesh 
problems  which  arise  in  network  problems,  image  processing,  and  discretization  of  partial 
differential  equations. 


(4)  G.  W.  Stewart,  A  Jacobi-like  Algorithm  for  Computing  the  Sehur  Decomposition  of  a  Non- 
Hermitian  Matrix,  TR-1321,  August,  1983. 

This  paper  describes  an  iterative  method  for  reducing  a  general  matrix  to  upper  triangular 
form  by  unitary  similarity  transformations.  The  method  is  similar  to  Jacobi’s  method  for 
the  symmetric  eigenvalue  problem  in  that  it  uses  plane  rotations  to  annihilate  off-diagonal 
elements,  and  when  the  matrix  is  Hermitian  it  reduces  to  a  variant  of  Jacobi’s  method. 
Although  the  method  cannot  compete  with  the  QR  algorithm  in  serial  implementation,  it 
admits  of  a  parallel  implementation  in  which  a  double  sweep  of  the  matrix  can  be  done  in 
time  proportional  to  the  order  of  the  matrix. 


(5)  D.  P.  O’Leary  and  G.  W.  Stewart,  Data-Flow  Algorithms  for  Matrix  Computations,  TR- 
1386,  January,  1984. 

In  this  work  we  develop  some  algorithms  and  tools  for  solving  matrix  problems  on  parallel 
processing  computers.  Operations  are  synchronized  through  data-flow  alone,  which  makes 
global  synchronization  unnecessary  and  enables  the  algorithms  to  be  implemented  on 
machines  with  very  simple  operating  systems  and  communications  protocols.  As  examples, 
we  present  algorithms  that  form  the  main  modules  for  solving  Liaponuv  matrix  equations. 
We  compare  this  approach  to  wavefront  array  processors  and  systolic  arrays,  and  note  its 
advantages  in  handling  missized  problems,  in  evaluating  variations  of  algorithms  or  archi¬ 
tectures,  in  moving  algorithms  from  system  to  system,  and  in  debugging  parallel 


algorithms  on  sequential  machines. 


(6)  G.  W.  Stewart,  W.  F.  Stewart  D.  F.  McAlister,  A  Two  Stage  Iteration  for  Solving  Nearly 
Uncoupled  Markov  Chains,  TR-1384,  1984. 

This  paper  presents  and  analyses  a  parallizable  algorithm  for  solving  Markov  chains  that 
arise  in  queuing  models  of  loosely  coupled  systems. 


(7)  Dianne  P.  O’Leary  and  Robert  E.  White,  Multi- Splittings  of  Matrices  and  Parallel  Solution 
of  Linear  Systems,  TR-1362,  December,  1983. 

We  present  two  classes  of  matrix  splittings  and  give  applications  to  the  parallel  iterative 
solution  of  systems  of  linear  equations.  These  splittings  generalize  regular  splittings  and 
P-regular  splittings,  resulting  in  algorithms  which  can  be  implemented  efficiently  on  paral¬ 
lel  computing  systems.  Convergence  is  established,  rate  of  convergence  is  discussed,  and 
numerical  examples  are  given. 


(8)  David  C.  Fisher,  In  Three-Dimensional  Space,  the  Time  Required  to  Add  N  Numbers  is 
0(N */4),  TR-1431,  August,  1984. 

How  quickly  can  the  sum  of  N  numbers  be  computed  with  sufficiently  many  processors? 
The  traditional  answer  is  t  =  0  ( log  N ).  However,  if  the  processors  are  in  R*  (usually 
d  <  3),  addition  time  and  processor  volume  are  bounded  away  from  zero,  and  transmis¬ 
sion  speed  and  processor  length  are  bounded,  t  >  0  (N'^+1). 


(9)  Dianne  P.  O’Leary,  G.  W.  Stewart,  On  the  Dcterminacy  of  a  Model  for  Parallel  Computa¬ 
tion,  TR-1456,  November,  1984.  (Obsolete:  see  TR-1553) 

In  this  note  we  extend  a  model  of  Karp  and  Miller  for  parallel  computation.  We  show 
that  the  model  is  deterministic,  in  the  sense  under  different  scheduling  regimes  each  pro¬ 
cess  in  the  computation  consumes  the  same  input  and  generates  the  same  output.  More¬ 
over,  if  the  computation  halts,  the  final  state  is  independent  of  scheduling. 


(10)  Dianne  P.  O’Leary,  Systolic  Arrays  for  Matrix  Transpose  and  Other  Reorderings,  TR-1481, 
March,  1985. 

In  this  note,  a  systolic  array  is  described  for  computing  the  transpose  of  an  n  X  n  matrix 
in  time  3n  -1  using  n  2  switching  processors  and  n  2  bit  buffers.  A  one-dimensional  imple¬ 
mentation  is  also  described.  Arrays  are  also  given  to  take  a  matrix  in  by  rows  and  put  it 
out  by  diagonals,  and  vice  versa. 


(11)  Dianne  P.  O’Leary,  G.  W.  Stewart,  Assignment  and  Schrduling  in  Parallel  Matrix  Factori¬ 
zation,  TR-1486,  April,  1985. 

We  consider  in  this  paper  the  problem  of  factoring  a  dense  n  Xn  matrix  on  a  network 
consisting  of  P  MIMD  processors  when  the  network  is  smaller  than  the  number  of  ele¬ 
ments  in  the  matrix  ( P  <  n2).  The  specific  example  analyzed  is  a  computational  network 
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that  arises  in  computing  the  LU,  QR,  or  Choiesky  factorizations.  We  prove  that  if  the 
nodes  of  the  network  are  evenly  distributed  among  processors  and  if  computations  are 
scheduled  by  a  round-robin  or  a  least-recently-executed  scheduling  algorithm,  then  optimal 
order  of  speed-up  is  achieved.  However,  such  speed-up  is  not  necessarily  achieved  for  other 
scheduling  algorithms  or  if  the  computation  for  the  nodes  is  inappropriately  split  across 
processors,  and  we  give  examples  of  these  phenomena.  Lower  bounds  on  execution  time 
for  the  algorithm  are  established 


(12)  Dianne  P.  O’Leary,  G.  W.  Stewart,  From  Determinacy  to  Systaltic  Arrays,  TR-1553, 
August,  1885. 

In  this  paper  we  extend  a  model  of  Karp  and  Miller  for  parallel  computation.  We  show 
that  the  extended  model  is  deterministic,  in  the  sense  that  under  different  scheduling 
regimes  each  process  in  the  computation  consumes  the  same  input  and  generates  the  same 
output.  Moreover,  if  the  computation  halts,  the  final  state  is  independent  of  scheduling. 
The  model  is  applied  to  the  generation  of  precedence  graphs,  from  which  lower  time 
bounds  may  be  deduced,  and  to  the  synchronization  of  systolic  arrays  by  local  rather  than 
global  control. 


(13)  David  C.  Fisher,  Matrix  Computation  on  Processors  in  One,  Two,  and  Three  Dimensions, 
TR-1556,  August,  1985. 

Suppose  a  problem  is  to  be  solved  on  a  d  -dimensional  parallel  processing  machine.  Asbjme 
that  transmission  speed  is  finite.  Under  this  and  other  "real  world”  assumptions,  if  a  prob¬ 
lem  requires  I  inputs,  K  outputs  and  T  computations,  then  time  required  to  solve  the 
problem  is  greater  than  or  equal  to  O  (max(/*'- ,K ,r1//*i  +  1))).  Algorithms  for  certain 
matrix  computations  are  developed.  The  problems  are  divided  into  atoms  .  The  algorithms 
are  described  and  analyzed  with  the  use  of  step  and  processor  assignment  functions. 
These  assign  each  atom  to  a  step  and  a  processor.  Here  is  a  table  showing  the  time  for 
algorithms  presented  in  this  paper: 


Problem 

mmSSmM 

Bn' 

Summation  of  k  numbers 

■Em 

MMMm 

Multiply  a.  k  Xk  matrix  by  a  k  vector 

MEHsm 

O(k) 

BGHM 

Multiply  two  k  xk  matrices 

m unsm 

0[k) 

KZQSH 

Choiesky  factorization  of  a  k  X  k  matrix 

_ oin _ 

— 

Except  for  matrix  multiplication  in  3-dimensions,  these  times  are  a  constant  multiple  of 
the  lower  bounds.  Programs  are  given  which  will  execute  these  algorithms  on  an  appropri¬ 
ate  parallel  processing  machine. 


(14)  D.  P.  O’Leary,  G.  W.  Stewart,  R.  van  de  Geijn,  DOMINO:  a  message  passing  environment 
for  parallel  computation.  TR-1648,  April,  1986. 


This  report  is  a  description  of  DOMINO,  a  system  to  coordinate  computations  on  a  net¬ 
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